Clustering Words with the MDL Principle 



Hang Li and Naoki Abe 

Theory NEC Laboratory, RWCP* 
c/o C&C Research Laboratories, NEC 
4-1-1 Miyazaki Miyamae-ku, Kawasaki, 216 Japan 
{Iihang,abe}@sbl. cl.nec.co.jp 



(D 

(N 

(N 

> 

O 

in 
o 

I 



X 
J3 



Abstract 

We address the problem of automati- 
cally constructing a thesaurus (hierarchi- 
cally clustering words) based on corpus 
data. We view the problem of cluster- 
ing words as that of estimating a joint 
distribution over the Cartesian product 
of a partition of a set of nouns and 
a partition of a set of verbs, and pro- 
pose an estimation algorithm using sim- 
ulated annealing with an energy func- 
tion based on the Minimum Descrip- 
tion Length (MDL) Principle. We em- 
pirically compared the performance of 
our method based on the MDL Principle 
against that of one based on the Max- 
imum Likelihood Estimator, and found 
that the former outperforms the latter. 
We also evaluated the method by con- 
ducting pp-attachment disambiguation 
experiments using an automatically con- 
structed thesaurus. Our experimental 
results indicate that we can improve ac- 
curacy in disambiguation by using such 
a thesaurus. 



1 Introduction 

Recently various methods for automatically con- 
structing a thesaurus (hierarchically clustering 
words) based on corpus data have been proposed 



([Hindlc, 1990|; [Brown et al., 1992| ; [Pereira et al 
1993; Tokunaga et al., 1995). The realization 



of such an automatic construction method would 
make it possible to a) save the cost of constructing 
a thesaurus by hand, b) do away with the subjec- 
tivity inherent in a hand made thesaurus, and c) 
make it easier to adapt a natural language pro- 
cessing system to a new domain. 



Although many of the proposed methods have 
proved to be effective, the word clustering prob- 
lem is still a problem which needs further investi- 
gation. In this paper, we propose a new method 
for automatic construction of thesauruses. Specif- 
ically, we view the problem of automatically clus- 
tering words as that of estimating a joint distribu- 
tion over the Cartesian product of a partition of 
a set of nouns (in general, any set of words) and 
a partition of a set of verbs (in general, any set 
of words), and propose an estimation algorithm 
using simulated annealing with an energy func- 
tion based on the Minimum Description Length 
(MDL) Principle. The MDL Principle is a well- 
motivated and theoretically sound principle for 
data compression and estimation from informa- 
tion theory and statistics. As a strategy of sta- 
tistical estimation, MDL is guaranteed to be near 
optimal. 

We empirically evaluated the effectiveness of 
our method. In particular, we compared the per- 
formance of our method based on the MDL Prin- 
ciple against that of one based on the Maximum 
Likelihood Estimator (MLE for short). We found 
that the MDL-based method performs better than 
the MLE-based method. We also evaluated our 
method by conducting structural (pp-attachment) 
disambiguation experiments using a thesaurus au- 
tomatically constructed by it and found that dis- 
ambiguation results can be improved. 

Since some words never occur in a corpus, and 
thus cannot be reliably classified by a method 
solely based on corpus data, we propose to com- 
bine the use of an automatically constructed the- 
saurus and that of a hand made thesaurus in dis- 
ambiguation. We conducted some experiments in 
order to test the effectiveness of this strategy. Our 
experimental results indicate that combining an 
automatically constructed thesaurus and a hand 
made thesaurus widens the 'coverage]] of disam- 
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^'Coverage' refers to the proportion (in percentage) 



biguation, while maintaining high 'accuracy'^. 

2 The Problem Setting 

Many of the methods of automatically construct- 
ing a thesaurus based on corpus data consist of the 
following three steps: (i) Extract co-occurrence 
data (e.g., case frame data, adjacency data) from 
a corpus, (ii) Starting from a single class (or each 
word composing its own class), divide (or merge) 
word classes based on the co-occurrence data us- 
ing some similarity (distance) measure. (The for- 
mer approach is called 'divisive,' the latter 'ag- 
glomerative.') (iii) Repeat step (ii) until some 
stopping condition is met, to construct a the- 
saurus tree. The method we propose here consists 
of the same three steps. 

Suppose available to us are data like those in 
Figure |], which are co-occurrence data between 
verbs and their objects, extracted from a corpus 
(step (i)). We then view the problem of clustering 
words as that of estimating a probabilistic model 
(representing a probability distribution) that gave 
rise to such data. We define the probabilistic 
model in the following way. We first define a noun 
partition 'Pj\f over a given set of nouns M and a 
verb partion Vv over a given set of verbs V. A 
noun partition is any set Vj^ satisfying Vj\f C 2^, 
i^c.^VuC^ = ^ and Va, Cj € Vu, QnC^ = 0. A 
verb partition Vv is defined analogously. In this 
paper, we call a member of a noun partition a 
'noun cluster,' and a member of a verb partition a 
'verb cluster.' We refer to a member of the Carte- 
sian product of a noun partition and a verb parti- 
tion ( e Vj\f X ) simply as a 'cluster.' We then 
define a probabilistic model (or a joint distribu- 
tion), written P{Cn,Cy), where random variable 
Cn assumes a value from a fixed noun partition 
V^, and Cy a value from a fixed verb partition 
Vv- Within a given cluster, we assume that each 
element is generated with equal probability, i.e., 

Vn e C„,Vz; G Cy,P{n,v) = (1) 

Figure]^ exhibits two example models which might 
have given rise to the data in Figure |^. 

In this paper, we assume that the observed data 
are generated by a model belonging to the class of 
models just described, and select a model which 
best explains the data. As a result of this, we ob- 
tain both noun clusters and verb clusters. This 



of test data for which the disambiguation method can 
make a decision. 

^'Accuracy' refers to the success rate, given that 
the disambiguation method makes a decision. 



problem setting is based on the intuitive assump- 
tion that similar words occur in the same context 
with roughly equal likelihood, as is made explicit 
in equation (|l|) . Thus selecting a model which best 
explains the given data is equivalent to finding the 
most appropriate classification of words based on 
their co-occurrence. 

3 Clustering with MDL 

We now turn to the question of what strategy 
(or criterion) we should employ in order to esti- 
mate the best model. Our choice is the Minimum 



Desc r iption Length (MDL) Principle ( Rissancn 



1978 



Rissanen, 1983; Rissancn, 1984 



Rissancn, 



1986; Rissanen, 1989), a well-known principle of 



data compression and estimation from informa- 
tion theory and statistics. MDL stipulates that 
the best probability model for given data is that 
model which requires the least code length for en- 
coding the model itself and the given data relative 
to it.^ We refer to the code length for the model 
as the 'model description length' and that for the 
data the 'data description length.' 

We apply MDL to the problem of estimating 
a model consisting of a pair of partitions as de- 
scribed above. In this context, a model with less 
clusters, such as Model 2 in Figure ^, tends to be 
simpler (in terms of the number of parameters), 
but also tends to have a poorer fit to the data. 
In contrast, a model with more clusters, such as 
Model 1 in Figure |^, is more complex, but tends 
to have a better fit to the data. Thus, there is a 
trade-off relationship between the simplicity of a 
model and the goodness of fit to the data. The 
model description length quantifies the simplicity 
(complexity) of a model, and the data descrip- 
tion length quantifies the goodness of fit to the 
data. According to MDL, the model which mini- 
mizes the sum total of the two types of description 
lengths should be selected. 

In what follows, we will describe in detail how 
to calculate the description length in our current 
context, as well as our simulated annealing algo- 
rithm based on MDL. 

3.1 Calculating Description Length 

We will now describe how to calculate the descrip- 
tion length for a model. Recall that each model 
is specified by the Cartesian product of a noun 
partition and a verb partition, and a number of 
parameters. Here we let fc„ denote the size of the 



We refer the interested reader to ( |Li and Abe 



1995) for explanation of rationals behind using MDL 



in natural language processing. 
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Figure 1: An example of co-occurrence data 



noun partition, and ky the size of the verb parti- 
tion. Then, there are fc„ • fc„ — 1 free parameters 
in a model. 

Given a model AI and data S, its total descrip- 
tion length L(M)0is computed as the sum of the 
model description length LmodiM), the parame- 
ter description length Lpar{M), and the data de- 
scription length Ldat{M) (we also sometimes refer 
to Lmod{M) + Lpar{M) as the model description 
length), namely, 

L{M) = Lmod{M) + LpariM) + Ldat{M). (2) 

We employ the 'binary noun clustering method,' 
in which k^, is fixed at |V| and we are to decide 
whether fc„ = 1 or fc„ = 2. This is as if we view 
the nouns as entities and the verbs as features 
and classify the entities based on their features. 
Since there are 2l^l subsets of the set of nouns 
A/", and for each binary noun partition we have 
two different subsets (a special case of which is 
when one subset is M and the other the empty set 
0), the number of possible binary noun partitions 
is 2l^l/2 = 2l^l-i. Thus for each binary noun 
partition we need — log ^^\J}\^l) — |7V| — 1 bitfl to 
describe it.^ Hence Lmod{M) is calculated asQ 

LrnodiM) = \M\ - 1. (3) 



depends on S, but we will leave S implicit. 
■''Throughout the paper 'log' denotes the logarithm 

to the base 2. 

^ For further e xplanation, see for example ( [Quinlan 



and Rivest, 1989) 



The exact formulation of I/mod (Af) is subjective, 
and it depends on the coding scheme used for the de- 
scription of the models. 



Lpar{M) is calculated by 



LpariM) 



k ■ k 



log|5|. 



(4) 



where \S\ denotes the data 1 is 

the number of free parameters in the model. As is 
well known, it is best to use — log = ^ dog l^l 



bits to describe each of the parameters, since the 
standard deviation of the maximum likelihood es- 



timation of each parameter is of order 



and 



hence describing each parameter using more than 
0(i • loglS"!) bits would be wasteful for the esti- 
mation accuracy possible with the given data size. 
Finally, Ldat{M) is calculated by 



Ldat{M) 



f{n,v)-\ogP{n,v), (5) 



where f{n, v) denotes the total observed frequency 
of noun verb pair (n, w), and P{n, v) the estimated 
probability of (n, v), which is calculated as follows: 

Vn e Cn,yv e Cy,P{n, v) ^ , (6) 



P(C„, Cy) 



I Cn X Cy I 
f{Cn, Cy) 

\s\ ' 



(7) 



where f{Cn,Cv) denotes the observed frequency 
of the noun verb pairs belonging to cluster 

(Cn I Cy ) . 

With the description length for a model de- 
fined in the above manner, we wish to select a 
model having the minimum description length and 
output it as the result of clustering. Since the 
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Figure 2: Two example models 



model description length L„iod is the same for each 
model, in practice we only need to calculate and 
compare L'{M) = Lpar{M) + Ldat{M)- 

The description lengths for the data in Figure |l| 
using the two models in Figure ^ are shown in 
Table |. (Table shows some values needed for 
the calculation of the description length for Model 
1.) These calculations indicate that according to 
MDL, Model 1 should be selected over Model 2. 

3.2 A Simulated Annealing-based 
Algorithm 

We could in principle calculate the description 
length for the data using each model and select 
a model with the minimum description length, if 
computation time were of no concern. Since the 
number of probabilistic models under considera- 
tion is exponential, however, this is not feasible 
in practice. We employ the 'simulated anneal- 
ing technique' to deal with this problem. Fig- 
ure H shows our (divisive) algorithm for hierar- 
chical word clustering.^ 



4 Advantages of Our Method 

Although there have been many methods of word 
clustering proposed to date, their objectives ap- 
pear different. In Table || and ^ we exhibit a 
simple comparison between our work and related 
work. Perhaps the method proposed by (Pereira 



Et al., 1993) is the most relevant in our context. In 
( Pereira et al., 1993D , they proposed a method of 



'soft clustering,' in which, each word can belong to 
a number of distinct classes with certain probabil- 
ities. Soft clustering has several desirable proper- 
ties. For example, word sense ambiguities in input 
data can be resolved naturally. Here, we restrict 
our attention on 'hard clustering' (i.e., each word 
must belong to exactly one class) , in part because 
we are interested in comparing thesauruses con- 
structed by our method with existing hand-made 
thesauruses. (Note that a hand made thesaurus is 
based on hard clustering.^) 

We next elaborate on the merits of our method. 
In statistical natural language processing, usu- 
ally the number of parameters in a probabilistic 
model to be estimated is very large, and there- 
fore such a model is difficult to estimate with a 
reasonable data size that is available in practice. 
(This problem is usually referred to as the 'data 
sparseness problem.') We could smooth the es- 
timated probabilities using an existing smooth- 
ing technique (e.g., (Dagan et al., 1992; Gale and 
Church, 1990| )), calculate some similarity measure 



using the smoothed probabilities, and then clus- 
ter words according to it. There is no guarantee. 



As we noted earlier, an alternative is to employ 
an agglomerative algorithm. 



^We wish to investigate the possibility of employ- 
ing MDL in soft clustering in the near future. Since 
MDL is a general criterion for statistical estimation, 
it can be used in other problem settings of word clus- 
tering. For example, recently, Stolcke & Omohundro 
proposed to use a Bavesian mod el merging technique 
( gtolcke and Omohundro, 1994 ), which is similar to 
MDL, for the problem of word clustering in t he con- 
text of estim ating n-gram models proposed by ( |Browr 
' 199^ ). 



3t al. 



Table 1: Estimating parameters of Model 1 
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1 Cn X Cv 1 
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Table 2: Description length for the models 



Model 1 


Lpar 
Ldat 
V 


^^^-^ X log 20 = 10.80 
-8 X log 0.2 - 8 X log 0.2 - 2 X log 0.05 - 2 X log 0.05 = 54.44 
10.80 + 54.44 = 65.24 


Model 2 


Lpar 
Ldat 
L' 


'""i,-' X log 20 - 4.32 
-8 X logO.l - 8 X logO.l - 4 X log0.05 ^ 70.44 
4.32 + 70.44 = 74.76 



however, that the employed smoothing method is 
in any way consistent with the clustering method 
used subsequently. Our method based on MDL 
resolves the clustering problem and the smooth- 
ing problem in a unified fashion. (For example, 
the probability of the noun verb pair (rice, make) 
is estimated (smoothed) to be 0.05 in Model 1, al- 
though the observed occurrence of it is (see Fig- 
ure and Figure ^.) By employing models that 
embody the assumption that words belonging to 
the same cluster occur with equal probability, our 
method achieves the smoothing effect as a side ef- 
fect of the clustering process, where the domains 
of smoothing coincide with the clusters obtained 
by clustering. Thus, the coarseness or fineness of 
clustering also determines the degree of smooth- 
ing. All of these effects fall out naturally as a 
corollary of the imperative of best possible esti- 
mation, the original motivation behind the MDL 
Principle. 



els with varying complexity, MLE tends to overfit 
the data, and output a model that is too com- 
plex and tailored to fit the specifics of the input 
data. If we employ MLE as criterion for the esti- 
mation, it will result in selecting a very fine model 
with many small clusters, most of which will have 
probabilities estimated as zero. Thus, in contrast 
to employing MDL, it will not have the effect of 
smoothing at all. 

Purely as a strategy (criterion) of statistical 
estimation as well, the superiority of MDL over 
MLE is supported by convincing theoretical find- 
ings. For instance, the speed of convergence of 
the models selected by MDL to the true model 
is known to be near optimal. (The models se- 
lected by MDL converge to the true model approx- 
imately at the rate of 1/s where s is the number 
of parameters in the true model, whereas for MLE 
the rate is 1/t, where t is the size of the domain, 
or in our context, the total number of elements 



lively employ the Maximum Likelihood Estima- 
tor (MLE) as criterion for estimation of the best 
probabilistic model, instead of MDL. MLE, as 



of A/" X V (Barron and Cover, 1991)( Yamanishi 



In our problem setting, we could alterna- 199^).) 'Consistency' is another desirable prop- 



erty of MDL, which is not shared by MLE. That is, 
the numbers of parameters in the models selected 



by MDL converge to that of the true model (Ris 



imizes the likelihood of the data, i.e., P~ 
argmaxp J^^gg P(a;). This is equivalent to mini- 
mizing the data description length as defined in 
Section 3, i.e., P — argminp ^^^^ — logP(a;). 
We can see easily that MDL generalizes MLE, 
in that it also takes into account the complex- 
ity of the model itself. In the presence of mod- 



its name suggests, selects a model which max- sanen, 1984]). Both of these properties of MDL 



are empirically verified in our present context, as 
will be shown in the next section. In particular, 
we have compared the performance of employing 
an MDL-based simulated annealing against that 
of one based on MLE in hierarchical word cluster- 
ing. 



Algorithm: Clustering 

1. Divide the noun set Af into two subsets. Define a probabilistic model consisting of the 
noun partition specified by the two subsets and the entire set of verbs. 

2. do { 

2.1 Randomly select one noun, remove it from the subset it belongs to and add it to the 
other. 

2.2 Calculate the description length for the two models (before and after the move) as Li 
and L2, respectively. 

2.3 Viewing the description length as the energy function for annealing, let AL = L2 ~ Li. 
If AL < 0, fix the move, otherwise ascertain the move with probability P = exp(— AL/T). 

} while (the description length has decreased during the past 10 • jA/"] trials.) 

Here T is the annealing temperature whose initial value is 1 and updated to be 0.9T after 
10 • \J\f\ trials. 

3. If one of the obtained subset is empty, then return the non-empty subset, otherwise 
recursively apply Clustering on both of the two subsets. 

Figure 3: Simulated annealing algorithm for hierarchical word clustering 



Table 3: Comparison to related work 





objective 


co-occurrence data 


Hindlc90 
Brown92 
Pereira93 
Tokunaga95 

This paper 


word classification 
n-gram model estimation 
structural and word sense disambiguation 
structural disambiguation, thesauruses for dif- 
ferent slots 

structural disambiguation, automatically con- 
structed thesaurus v.s. hand-made thesaurus 


case frame data 
adjacency data 
case frame data 
case frame data 

case frame data 



5 Experimental Results 

We describe our experimental results in this sec- 
tion. 

5.1 Experiment 1: MDL v.s. MLE 

As described in the previous section, there are 
some theoretical findings verifying that employ- 
ing MDL performs better than employing MLE in 
statistical estimation. We empirically test if this 
is the case in our current context. We artificially 
constructed a true model of word co-occurrence 
(see Figure ||) , and then generated data according 
to its distribution. We then used the data to es- 
timate a model (hierarchically cluster words) by 
employing MDL and MLE, respectively. (The al- 
gorithm used for MLE was the same as that shown 
in Figure ||, except the data description length re- 
places the total description length in Step 2.) We 
evaluated the two methods in terms of the num- 
ber of noun clusters and the KL distance.^^ Fig- 

^°The KL distance (relative entropy), which is 
widely used in information theory and statistics, is 
a measure of 'distance' between two distributions 



ure ^(a) plots the relation between the number of 
obtained noun clusters (leaf nodes in the obtained 
thesaurus tree) versus the data size, averaged over 
10 trials. (Note that the number of noun clus- 
ters in the true model is 4.) Figure ^(b) plots 
the KL distance versus the data size, also aver- 
aged over the same 10 trials. The results indicate 
that MDL converges to the true model faster than 
MLE. Also, MLE tends to select a model that is 
too large (overfitting the data) , while MDL tends 
to select a model which is simple and yet fits the 
data reasonably well. We conducted the same sim- 
ulation experiments for some other models and 
found the same tendencies. We conclude that 
it is better to employ MDL than MLE, as a cri- 

( |Cover and Thomas, 1991 ). It is always non-negative 
and is zero itt the two distributions are identical, but is 
asymmetric and hence not a metric (the usual notion 
of distance). 

^^The models we constructed were small as a model 
of word clustering with practical significance. We be- 
lieve, however, that the convergence characteristics of 
the estimation methods for these models should carry 
over to the cases of estimating more practical, larger 
models. 



Table 4: Comparison to related work 





strategy 


algorithm 


Hindlc90 
Brown92 
Pereira93 
Tokunaga95 
This paper 


heuristics 

maximizing likelihood 
minimizing free energy 
maximizing classification probability 
minimizing description length 


agglomerative, hard clustering 
divisive, soft clustering 
agglomerative, hard clustering 
divisive, hard clustering 
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Figure 4: An artificial model 



terion in simulated annealing-based hierarchical 
word clustering. 



5.2 Experiment 2: Qualitative Evaluation 

We extracted roughly 180, 000 case frames from 
the bracketed Wall Street Journal (WSJ) corpus 
of the Penn Tree Bank ( Marcus et al., 1993| ) as 
co-occurrence data. We then constructed a num- 
ber of thesauruses based on these data, using our 
method. Figure || shows an example thesaurus 
for the 20 most frequently observed nouns in the 
data, constructed based on their appearances as 
subjects and objects of roughly 2000 verbs. The 
obtained thesaurus seems to agree with human 
intuition to some degree. For example, 'million' 



5.3 Experiment 3: Disambiguation 

We also evaluated our method by using a con- 
structed thesaurus in pp-attachment disambigua- 
tion experiments. 

We used as training data the same 180, 000 case 
frames used in Experiment 2. We also extracted as 
our test data 172 {verb, nouni, prep, noun2) pat- 
terns from the data in the same corpus, which 
were not used in the training data. For the 
150 words that appear in the position of noun2, 
we constructed a thesaurus based on the co- 
occurrences between heads and slot values of the 
frames in the training data. This is because in 
our disambiguation experiments we only need a 
thesaurus consisting of these 150 words. We then 



applied the learning method proposed in (Li and 



and 'billion' are classified in one noun cluster, and Abe, 1995) to learn case frame patterns using 



'stock' and 'share' are classified together. Not all 
of the noun clusters, however, seem to be mean- 
ingful in the useful sense. This general tendency 
is also observed in other example thesauruses ob- 
tained by our method. Pragmatically speaking, 
however, whether the obtained thesaurus agrees 
with our intuition in itself is only of secondary 
concern, since the main purpose is to use the con- 
structed thesaurus to help improve on a disam- 
biguation task. 



the constructed thesaurus with the same train- 
ing data as input. We formalize the case frame 
patterns as conditional distributions of the form 
P (C las s\head, prep), where Class varies over the 
internal nodes in a certain 'cut' in the thesaurus 
treej^ Our method selects the optimal cut in the 
thesaurus tree using the given data in the sense 
of MDL, that is, a cut which is fine enough to 
capture the tendency in the input data, but is 



A 'cut' in a thesaurus tree defines a partition over 
the set of nouns appearing in the thesaurus. 




Figure 5: (a) Number of noun clusters v.s. data size and (b) KL distance v.s. data size 
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one market year 

investor official bank 

sale loss 

rate price 

stock share 

billion million 



Figure 6: An example thesaurus 



coarse enough to have a reasonably small num- 
ber of parameters to estimate. It also estimates 
P [C las s\head, prep ) for each Class in the cut (see 
( |Li and Abe, 1995| ) for further detail). Table | 
shows some example case frame patterns obtained 
by this method, and Figure shows the leaf nodes 
dominated by the internal nodes appearing in the 
case frame patterns of Table H. 



Table 5: Examples of case frame patterns 



input 


freq. 


question about attitude 


1 


question about corporation 


1 


question about strength 


2 


case frame pattern 


prob. 


P((strength) jquestion, about) 


0.50 


P((#80) Iquestion, about) 


0.25 


P((#122) Iquestion, about) 


0.25 



Table 6: PP-attachment disambiguation results 





coverage(%) 


accuracy(%) 


Basc-Line 


100 


70.2 


Word-Based 


19.7 


95.1 


MDL-Thesaurus 


33.1 


93.0 


MLE-Thesaurus 


33.7 


89.7 


WordNet 


49.4 


88.2 



We then conducted pp-attachment disambigua- 
tion ex- 
periments. We compare P{noun2\verb,prep) and 
P{noun2\nouni , prep) , which are calculated based 
on the case frame patterns, to determine the at- 
tachment site of {prep,noun2). More specifically, 
if the former is larger than the latter, we attach 
it to verb] and if the latter is larger than the for- 
mer, we attach it to nouni] and otherwise (includ- 
ing the case in which both are 0), we conclude 
that we cannot make a decision. Table @ shows 



#80: 

ground , wake , success , network , game , rest , art , organizat ion , plane , output , 
television , benef it , letter , holder , support , nation , corporation , review , 
thousand , manuf acturer , margin , man , meeting , customer , agent , help 

#122: 

reorganization , attitude , relief , competition , constitution 

Figure 7: Internal nodes and the leaf nodes they dominate 



the results of the experiments in terms of 'cover- 
age' and 'accuracy.' Here 'coverage' refers to the 
proportion (in percentage) of the test patterns on 
which the disambiguation method can make a de- 
cision. 'Base-Line' refers to the method of always 
attaching {prep,noun2) to nouni. 'Word-Based,' 
'MLE-Thesaurus,' and 'MDL-Thesaurus' respec- 
tively stand for using word-based estimates, us- 
ing a thesaurus constructed by employing MLE, 
and using a thesaurus constructed by our method. 
Note that the coverage of 'MDL-Thesaurus' signif- 
icantly outperformed that of 'Word-Based,' while 
basically maintaining high accuracy (though it 
drops somewhat), indicating that using an auto- 
matically constructed thesaurus can improve dis- 



ambiguation results in terms of coverage. 



We also tested the case of using an existing the- 
saurus (instead of an automatically constructed 
thesaurus) to learn case frames. In particu- 



lar, we used this method with WordNet ( |MiIIer| 19951) (84.3%) 



et al., 1993| ) and the same training data, and 
then conducted a pp-attachment disambiguation 
experiment using the obtained case frame pat- 
terns. We represent the result of this exper- 
iment as 'WordNet' in Table ^. We can see 
that in terms of coverage, WordNet outperforms 
MDL-Thesaurus, but in terms of accuracy, MDL- 
Thesaurus outperforms WordNet. These results 
can be interpreted as follows: An automatically 
constructed thesaurus is more domain dependent 
and therefore captures the domain dependent fea- 
tures better, and thus using it achieves high ac- 
curacy. On the other hand, since training data 
we had available is insufficient, its coverage is 
smaller than that of a hand made thesaurus. In 
practice, it makes sense to combine both types 
of thesauruses. That is, an automatically con- 
structed thesaurus can be used within its cover- 
age, and outside its coverage, a hand made the- 
saurus can be used. Given the current state of 
the word clustering technique (namely, it requires 
data size that is usually not available, and it tends 
to be computationally demanding), this strat- 
egy is practical. We tested this strategy. More 



specifically, we compare P{noun2\verh,prep) and 
P[noun2\nouni,prep) calculated from case frame 
patterns obtained using an automatically con- 
structed thesaurus; when the two probabilities are 
equal, including the case in which both are 0, 
we compare the probabilities calculated from case 
frame patterns obtained using WordNet. Table @ 
represents the result of this combined method as 
'MDL-Thesaurus -f WordNet.' The experimen- 
tal result indicates that employing the combined 
method does increase the coverage of disambigua- 
tion. We also tested 'MDL-Thesaurus + WordNet 
-I- LA -I- Default,' which stands for using the con- 
structed thesaurus and WordNet first, then the 



lexical association value proposed by (Hindle and 



Rooth, 1991), and finally the default (i.e., always 
attaching prep, noun2 to nourii). Figure ^ shows 
the results. Our best disambiguation result ob- 
tained using this last combined method slightly 



improves the accuracy reported in (Li and Abe, 



6 Concluding Remarks 

We have proposed a method of automatically 
constructing a thesaurus (hierarchically clustering 
words) based on corpus data. We conclude with 
the following remarks. 

1 . Our method of hierarchically clustering words 
based on the MDL Principle is theoretically 
sound. Our experimental results indicate 
that it is better to employ MDL than MLE 
as estimation criterion in hierarchical word 
clustering. 

2. Using a thesaurus constructed by our method 
can improve pp-attachment disambiguation 
results. 

3. Given the current state of the art in statistical 
natural language processing, it is best to use a 
combination of an automatically constructed 
thesaurus and a hand made thesaurus for dis- 
ambiguation purpose. The disambiguation 
accuracy obtained this way was 85.5%. 



Table 7: PP-attachment disambiguation results 
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accuracy(%) 


MDL- Thesaurus + WordNet 
MDL-Thesaurus + WordNet + LA + Default 
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Figure 8: Accuracy v.s. coverage 



In the future, hopefully with larger training 
data sizes, we plan to construct larger thesauruses 
as well as to test other clustering algorithms. 
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